While I am studying part time to become a data scientist, my full-time job is at The Boeing Company. There has recently been a lot of news and media on Boeing surrounding the 737Max crashes. This brings attention to air travel and the safety of air travel.
I hope that I can provide some scientific, fact-based, data-based information to these busy American travelers. Hopefully my research can provide conclusions that any adult reader can understand and find value in. This analysis will also strive to provide a historical lens from which to view the recent events. I will focus on United States air travel specifically.
In my analysis I plan to investigate the trends in aviation accidents and fatalities over time. I will also look at tangible factors that all readers can understand and use descriptive statistics and visualizations to bring light to patterns if they exist.
I did not come across any ethical concerns in finding this aviation data or utilizing it. It is federally licensed and distributed - open for public consumption. In fact, the largest data set is from the "ASIAS" project - Aviation Safety Information Analysis and Sharing. It appears the intention behind making this data available is to encourage transparency and analysis in the aviation industry.
There are several good reasons to perform a human centered data science investigation on the safety of air travel. The first reason is that with the recent media coverage surrounding the Boeing 737Max crashes, there is more journalism than ever on the topic. This journalism however is not always fact based or unbiased. The more media coverage there is, the more difficult it becomes to understand what is actually true. I believe it has gotten more and more difficult for American readers to sort through the information in front of them and determine what is subjective vs. objective. Many Americans are tired of this and want to get straight to the facts about air travel safety. I think my analysis could provide a means to do this.
Second, air travel has become increasingly popular over the past 3 decades, which we will see in my analysis below. Air travel is becoming more popular in the commercial sector specifically, as businesses depend on airplanes for international collaboration. On an individual level, recreational travel and tourism has gotten increasingly popular as prices for airplane tickets have dropped. This is a good reason to stop and take a moment to reflect on where air travel safety has been, and where it is headed as demand increases in the coming decade.
Finally, perhaps the most important reason for a human centered analysis on the topic: the laws of aerodynamics are complicated! So complicated that a graduate level degree in an aeronautical field is required to truly understand how planes work. For a naive reader not a part of the aerospace industry, it is hard to bridge this gap to truly understand the issues that the industry faces. This gap in understanding of how planes work leads to a gap in trust between the customers and the products. If customers don't understand what factors make a plane safe or not, how can they make an informed decision? How will they know whether or not to trust this flying metal bird in the sky? For the future of the aerospace industry, we must make products our customers have confidence in! This is why this type of work is even more important.
Upon some google searching about the safety of flight, one article brings up some interesting statistics which immediately point to the overwhelming safety of air travel in comparison to other modes of transportation. "There are a range of estimates out there, but based on its analysis of US Census data, it puts the odds of dying as a plane passenger at 1 in 205,552. That compares with odds of 1 in 4,050 for dying as a cyclist; 1 in 1,086 for drowning, and 1 in 102 for a car crash." (SBS) This article goes on to list some numbers about accidents, however, it only lists accidents from the US and Canada from 2013 to 2017. After that, the analysis looks at all countries, which my analysis does not. It appears to use the US National Safety Council data, not FAA data.
Another top search result, a self help blog for people with flying anxiety also compares air travel to other modes of travel. It states that "In fact, based on this incredible safety record, if you did fly every day of your life, probability indicates that it would take you nineteen thousand years before you would succumb to a fatal accident. Nineteen thousand years!" An additional comparison to the dangers of driving points out that a sold-out 727 jet would have to crash every day of the week, with no survivors, to equal the highway deaths per year in this country."
It points to outside sources for the following numbers:
DEATH BY: YOUR ODDS
Cardiovascular disease: 1 in 2
Smoking (by/before age 35): 1 in 600
Car trip, coast-to-coast: 1 in 14,000
Bicycle accident: 1 in 88,000
Tornado: 1 in 450,000
Train, coast-to-coast: 1 in 1,000,000
Lightning: 1 in 1.9 million
Bee sting: 1 in 5.5 million
U.S. commercial jet airline: 1 in 7 million
[Sources: Natural History Museum of Los Angeles County, Massachusetts Institute of Technology, University of California at Berkeley]
These numbers vary significantly from the previous article, which could have to do with the date that it was written. It could also simply be because these odds are hard to estimate. Not only do you need to use an appropriate accident, you also need to appropriately estimate how often someone is flying or driving. For this very reason, I did not seek to incorporate other modes of transportation in to my analysis. The data was simply not available or accurate. In any case, it does not deep dive in to accidents, pilots, or any sort of time series analysis.
Finally, probably the most relevant result I found was a bloomberg article warning that flying has become more dangerous recently. This article is touching on the same hot topic that I wish to address. While it does include one bar chart of total annual fatalities from 2010 to 2018, it does not include any other visuals. Even this visual is based on the world passenger airline fatality data, not the United States. It does not appear to be the same data set as mine, and the article quickly moves on to other factors in airplane safety such as demand for speed and cost cutting by airlines, as well as an increased burden of safety regulation.
I am excited say that despite the recent media craze, there really doesn't seem to be many data driven articles or publications recently to appeal to the general public in regards to the safety of air travel. This means that my research could really provide some benefit to the general public. It does not appear that my work is directly building off of anyone's previous. Since FAA data is widely available, I would not be surprised if someone had already done this analysis, however, it was not published and popularized anywhere that I found.
In my attempts to simplify aviation jargon and confusing regulatory terminology, I made this table below to explain some of the terms that may be used through out my project.
Through out the rest of this analysis, the terms 'private' and 'public'/'commercial' will be used to short hand reference some of these distinctions in definitions below.
| Term | Definition |
|---|---|
| NTSB | National Transportation Safety Board |
| 14 CFR-121 | Specification of Operating Requirements: Domestic, Flag, and Supplemental Operations, more than 10 passengers, scheduled service |
| Domestic Operation | Departure and Arrival Airport both within US |
| Flag Operation | Departure from US Airport, Arrival in non-US Airport |
| Supplemental Operation | Departure and Arrival from US Airports, cargo or large charter |
| Accident | An accident in these data sets means that an illegal act such as suicide, sabotage, or terrorism was NOT responsible for the occurrence. For example, 9/11 2001 terrorism fatalies are excluded but available another data set. |
| Accident, Major | Major - an accident in which any of three conditions is met: 1) a Part 121 aircraft was destroyed, or 2) there were multiple fatalities, or 3) there was one fatality and a Part 121 aircraft was substantially damaged. |
| Accident, Serious | Serious - an accident in which at least one of two conditions is met: 1) there was one fatality without substantial damage to a Part 121 aircraft, 2) there was at least one serious injury and a Part 121 aircraft was substantially damaged. |
| Accident, Injury | Injury - a nonfatal accident with at least one serious injury and without substantial damage to a Part 121 aircraft. |
| Accident, Damage | Damage - an accident in which no person was killed or seriously injured, but in which any aircraft was substantially damaged. |
| Source | File Name | Table Name | Years | Data Source |
|---|---|---|---|---|
| 1 | faaAccidentIncidentDataSystem.csv | FAA Accident and Incident Data System (AIDS) | 1978 - 2015 | https://www.asias.faa.gov/apex/f?p=100:11:::NO::: |
| 2 | accidentsAccidentRates_scheduledPass.csv | Accidents and Accident Rates by NTSB Classification, 1995 through 2014, for U.S. Air Carriers Operating Under 14 CFR 121 | 1983 - 2014 | https://catalog.data.gov/dataset/accidents-and-accident-rates-by-ntsb-classification-1995-through-2014-for-u-s-air-carriers |
| 3 | accidentsFatalitiesRates_airlines.csv | Accidents, Fatalities, and Rates, 1995 through 2014, for U.S. Air Carriers Operating Under 14 CFR 121, Scheduled and Nonscheduled Service (Airlines) | 1983 - 2014 | https://catalog.data.gov/dataset/accidents-fatalities-and-rates-1995-through-2014-for-u-s-air-carriers-operating-under-14-c-dae36 |
| 4 | accidentsFatalitiesRates_genAv.csv | Accidents, Fatalities, and Rates, 1995 through 2014, U.S. General Aviation | 1975 - 2014 | https://catalog.data.gov/dataset/accidents-fatalities-and-rates-1995-through-2014-u-s-general-aviation |
There are multiple data sets that I plan to use in tandem to complete my analysis. The most detailed data set is the FAA Accident & Incident Data System - which contains a detailed record for every single accident and incident from ~1978 to ~2015 (with MM/DD/YYYY available for each) for any type of United States flight (private, commercial, scheduled, unschedule, cargo, passenger). Examples of the relevant fields of data available for each row item are: Local Event Date, Event City, Event State, Event Airport, Event Type, Aircraft Damage, Flight Phase, Aircraft Make, Aircraft Model, Aircraft Series, Operator, Primary Flight Type, Total Fatalities, Total Injuries, PIC Certificate Type, PIC Flight Time Total Hrs, PIC Flight Time Total Make-Model.
The other data sets are at an aggregate level and may be more useful for those less familiar with aviation terminology. The second and third data sets are for scheduled, passenger flights - so generally commercial airlines. The second source contains more details as to the severity of accidents and the third source contains more details as to the number of fatalities. They have several columns of redudant data though, so I will likely join these two data sets and eliminate the redudandant columns and keep all unique columns so that I have the maximum granularity for these two data sets. The second data sets contains the following fields: Year, Accidents: Major,Accidents: Serious, Accidents: Injury, Accidents: Damage, Aircraft hours flown (millions), Accidents per Million Hours Flown: Major, Accidents per Million Hours Flown: Serious, Accidents per Million Hours Flown: Injury, Accidents per Million Hours Flown: Damage.
The third data set contains the following fields: Year, Accidents: All, Accidents, Fatal, Fatalities: Total, Fatalies: Aboard, Flight Hours, Miles Flown, Departures, Accidents per 100,000 Flight Hours: All, Accidents per 100,000 Flight Hours: Fatal, Accidents per 1,000,000 Miles Flown: All, Accidents per 1,000,000 Miles Flown: Fatal, Accidents per 100,000 Departures: All, Accidents per 100,000 Departures: Fatal.
The fourth data set is general aviation, so it includes private and personal flights, not just scheduled commercial passenger flights. Therefore it includes more types of planes, smaller planes, and more airports. That information is not detailed in this data set, but on a whole it represents way more flight hours and way more total accidents. It contains the fields: Year, Accidents: All, Accidents: Fatal, Fatalities: Total, Fatalities: Aboard, Flight Hours, Accidents per 100,000 Flight Hours: All, Accidents per 100,000 Flight Hour: Fatal.
For each research question proposed, I will now explain my methodology for this analysis and presentation to the reader.
For this, I will use data sources 2 & 3 to plot the aggregate totals either as a line or bar chart. Source 2 will be for accidents and source 3 will be for fatalities. I will make one plot for fatalities and one plot for accidents. For accidents, I will either plot multiple lines in different colors to denote the different types of accident totals, or used a stacked bar chart. For fatalities I will also use a line or bar chart.
For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross miles flown per year. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of miles flown per year. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.
For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross annual flight hours. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of annual flight hours. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.
After addressing parts 1a-1c for just commercial* flights, I have data in table 4 contains the same accident, fatality, and rate data but instead for general aviation in the United States. I will use this to compare the relative safety of commercial flight vs general flight (include private and personal flights). I will choose just one metric - accidents per flight hour, and plot both the commercial and general aviation lines on the same plot in different colors.
Finally, in looking at source 1, my most detailed data source, I can see more granular information for each accident. This contains the type of airplane for every data entry. To present to the reader what planes are responsible for the most crashes, I will create a pivot table that sums the number of accidents for each plane. Then, I will create a table for the reader that displays just the top 5 planes. Since I do not expect the reader to have a knowledge of all types of planes, I will either include an image or brief description of the plane.
A question a reader may ask if they are boarding a plane is - how experienced is my pilot? Data source 1, the detailed table, also contains a column with number of hours of experience of the Pilot In Control for most accidents. I will extract this column and create a histogram so that the reader can see the distribution of number of experience hours of pilots who have been in crashes, and observe for themselves if they think that number of hours could be a factor in safety.
import pandas as pd
import os
import numpy as np
import copy
import matplotlib.pyplot as plt
from IPython.display import Image
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
Data sources 2 and 3 both pertain to commercial airline data. Since there is quite a bit of overlap between the information in item 2 and 3, the first data processing task will be to join the two datasets.
Our final cleaned, combined file of airline data will be saved as data_clean\airline_aggregate.csv
# Read in accidents data and accidents+fatalies data
accidents_df = pd.read_csv('accidentsAccidentRates_scheduledPass.csv', sep=',', header=0) # source 2
accidents_fatal_df = pd.read_csv('accidentsFatalitiesRates_airlines.csv', sep=',', header=0) # source 3
# Look at columns in data set 2
accidents_df.head(1)
list(accidents_df.columns)
# Look at columns in data set 3
list(accidents_fatal_df.columns)
# Merge data sets 2 and 3
accidents_all_df = pd.merge(accidents_df, accidents_fatal_df, how = 'inner')
For better understanding and readability, we will drop one redundant column and reorder the remaining columns so that similar data is grouped together.
# Drop a redundant column
accidents_all_df = accidents_all_df.drop(columns=['Aircraft Hours Flown (millions)'])
# Reorder columns
accidents_all_df_reordered = accidents_all_df[['Year',
'Illegal Act',
'Flight Hours',
'Miles Flown',
'Departures',
'Accidents, Major',
'Accidents, Serious',
'Accidents, Injury',
'Accidents, Damage',
'Accidents, All',
'Accidents, Fatal',
'Fatalities, Total',
'Fatalities, Aboard',
'Accidents per Million Hours Flown, Major',
'Accidents per Million Hours Flown, Serious',
'Accidents per Million Hours Flown, Injury',
'Accidents per Million Hours Flown, Damage',
'Accidents per 100,000 Flight Hours, All',
'Accidents per 100,000 Flight Hours, Fatal',
'Accidents per 1,000,000 Miles Flown, All',
'Accidents per 1,000,000 Miles Flown, Fatal',
'Accidents per 100,000 Departures, All',
'Accidents per 100,000 Departures, Fatal']]
accidents_all_df_reordered.head(10)
In this next step of data cleaning, we check the data types of the all the items in the data set and change them where needed to suit our computation.
accidents_all_df_reordered.dtypes
Not all of the items are the correct data type. All of the items that are objects should be changed to int64 with the exception of Illegal Act which is a string (Yes or No), and rates which should be float64.
# Remove commas from numbers which are currently string objects
accidents_all_df_no_comma = accidents_all_df_reordered.replace(',','', regex=True)
# Update types of columns
accidents_all_df_no_comma['Illegal Act'] = accidents_all_df_no_comma['Illegal Act'].astype(str)
accidents_all_df_no_comma['Flight Hours'] = accidents_all_df_no_comma['Flight Hours'].astype(int)
accidents_all_df_no_comma['Miles Flown'] = accidents_all_df_no_comma['Miles Flown'].astype(int)
accidents_all_df_no_comma['Departures'] = accidents_all_df_no_comma['Departures'].astype(int)
# Fill items with dashes with NaN
accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'].loc[accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'].str.contains('-')] = np.NaN
accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'].loc[accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'].str.contains('-')] = np.NaN
accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'].loc[accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'].str.contains('-')] = np.NaN
# Convert remaining object columns to floats
accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'] = accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'].astype(float)
accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'] = accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'].astype(float)
accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'] = accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'].astype(float)
accidents_all_clean = accidents_all_df_no_comma.copy()
accidents_all_clean.dtypes
accidents_all_clean = accidents_all_clean.set_index(accidents_all_df_no_comma['Year'])
Now let's save the file we have cleaned up to our clean data folder.
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/
accidents_all_clean.to_csv('data_clean/airline_aggregate.csv')
Data source 4 contains similar accident and fatality roll up data as sources 2 and 3, but instead of for commercial flights it is for all general aviation in the United States.
Our final cleaned file of general av data will be saved as data_clean\genav_aggregate.csv
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
# Read in accidents data for all aviation in the US
genav_df = pd.read_csv('accidentsFatalitiesRates_genAv.csv', sep=',', header=0) # source 4
genav_df
# Fill any dashes with 0 so the row can be converted to int
genav_df['Flight Hours'].loc[genav_df['Flight Hours'].str.contains('-')] = 0
genav_df['Accidents per 100,000 Flight Hour, All'].loc[genav_df['Accidents per 100,000 Flight Hour, All'].str.contains('-')] = 0
genav_df['Accidents per 100,000 Flight Hour, Fatal'].loc[genav_df['Accidents per 100,000 Flight Hour, Fatal'].str.contains('-')] = 0
genav_df.dtypes
Not all of the items are the correct data type. All of the items that are objects should be changed to int64 with the exception of 'Accidents per 100,000' which should be float64.
# Remove commas from numbers which are currently string objects
genav_no_comma = genav_df.replace(',','', regex=True)
# Update some columns from objs to ints
genav_no_comma['Accidents, All'] = genav_no_comma['Accidents, All'].astype(int)
genav_no_comma['Fatalities, Total'] = genav_no_comma['Fatalities, Total'].astype(int)
genav_no_comma['Fatalities, Aboard'] = genav_no_comma['Fatalities, Aboard'].astype(int)
genav_no_comma['Flight Hours'] = genav_no_comma['Flight Hours'].astype(int)
# Update rate columns from objs to floats
genav_no_comma['Accidents per 100,000 Flight Hour, All'] = genav_no_comma['Accidents per 100,000 Flight Hour, All'].astype(float)
genav_no_comma['Accidents per 100,000 Flight Hour, Fatal'] = genav_no_comma['Accidents per 100,000 Flight Hour, Fatal'].astype(float)
# Change rows with zeros back to np.nan
genav_no_comma = genav_no_comma.replace(0,np.nan)
genav_clean = genav_no_comma.copy()
genav_clean.dtypes
genav_clean = genav_clean.set_index(genav_clean['Year'])
Now that we have finished cleaning the data lets save the final product to our clean data folder.
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/
genav_clean.to_csv('data_clean/genav_aggregate.csv')
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
# Read in accidents data for all aviation in the US
faa_aids = pd.read_csv('faaAccidentIncidentDataSystem.csv', sep=',', header=0) # source 1
# Check data types
faa_aids.dtypes
Since we will be using the Aircraft Make, Model, and Series I will convert these to strings.
# Cast types as strings
faa_aids['Aircraft Make'] = faa_aids['Aircraft Make'].astype(str)
faa_aids['Aircraft Model'] = faa_aids['Aircraft Model'].astype(str)
faa_aids['Aircraft Series'] = faa_aids['Aircraft Series'].astype(str)
No more data processing is needed for this data set. There were not significant changes that merit saving another copy.
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject
For this, I will use data sources 2 & 3 to plot the aggregate totals either as a line or bar chart. Source 2 will be for accidents and source 3 will be for fatalities. I will make one plot for fatalities and one plot for accidents. For accidents, I will either plot multiple lines in different colors to denote the different types of accident totals, or used a stacked bar chart. For fatalities I will also use a line or bar chart.
Below, we will plot the number of annual accidents as a simple line plot.
# Set up the plot
fig = plt.figure(1, figsize=(18, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot multiple lines
plt.plot(accidents_all_clean['Accidents, Major'], "--", color = 'orange', label = 'Accidents, Major')
plt.plot(accidents_all_clean['Accidents, Serious'], "--", color = 'green', label = 'Accidents, Serious')
plt.plot(accidents_all_clean['Accidents, Injury'], "--", color = 'blue', label = 'Accidents, Injury')
plt.plot(accidents_all_clean['Accidents, Damage'], "--", color = 'purple', label = 'Accidents, Damage')
plt.plot(accidents_all_clean['Accidents, Fatal'], "--", color = 'red', label = 'Accidents, Fatal')
plt.plot(accidents_all_clean['Accidents, All'], "-", color = 'black', label = 'Accidents, All')
# Add titles, axes labels, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents')
plt.legend(loc='upper left')
fig.savefig('results/accidents_lines.png')
plt.show()
| Term | Definition |
|---|---|
| Accident, Major | Major - an accident in which any of three conditions is met: 1) a Part 121 aircraft was destroyed, or 2) there were multiple fatalities, or 3) there was one fatality and a Part 121 aircraft was substantially damaged. |
| Accident, Serious | Serious - an accident in which at least one of two conditions is met: 1) there was one fatality without substantial damage to a Part 121 aircraft, 2) there was at least one serious injury and a Part 121 aircraft was substantially damaged. |
| Accident, Injury | Injury - a nonfatal accident with at least one serious injury and without substantial damage to a Part 121 aircraft. |
| Accident, Damage | Damage - an accident in which no person was killed or seriously injured, but in which any aircraft was substantially damaged. |
# Set color scheme for stacked bar chart
colors = ['#f53b11', '#f0701a', '#ed9134', '#f0b54f', '#f2d999']
# Plot all data as separate bars
accidents_all_clean.loc[:,['Accidents, Fatal', 'Accidents, Major','Accidents, Serious', 'Accidents, Injury',
'Accidents, Damage']].plot.bar(stacked=True, color=colors, figsize=(17,8))
# Add title, axes, and legends
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents')
plt.savefig('results/accidents_bars.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(18, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot both total fatalities and fatalities on board
plt.plot(accidents_all_clean['Fatalities, Total'], "-", color = 'black', label = 'Fatalities, Total')
plt.plot(accidents_all_clean['Fatalities, Aboard'], "--", color = 'green', label = 'Fatalities, Aboard')
# Add title, axes labels, and legend to plot
plt.title('Fatalities Resulting from Aircraft Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
plt.xticks(accidents_all_clean['Year'])
ax.set_ylabel('Fatalities Due to Aviation')
plt.legend(loc='upper left')
fig.savefig('results/fatalities_lines.png')
plt.show()
For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross miles flown per year. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of miles flown per year. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.
# Set up the plot
fig = plt.figure(1, figsize=(18, 5))
ax = fig.add_subplot(1, 1, 1)
# Plot departure data as a bar chart
plt.bar(accidents_all_clean['Year'], accidents_all_clean['Departures']/1000000, align='center', alpha=0.5, color='#f59bfa')
# Add legends, titles, axes labels
plt.xticks(accidents_all_clean['Year'])
plt.ylabel('Departures in Millions')
plt.title('Increase in Departures in the United States over Time')
ax.set_xlabel('Year 1983 - 2014')
fig.savefig('results/departures.png')
plt.show()
There is a clear increase in the number of overall departures, especially with a bigger spike in the late 90s.
# Set up the plot
fig = plt.figure(1, figsize=(18, 6))
ax = fig.add_subplot(1, 1, 1)
# Plot the miles flown data as a bar chart
plt.bar(accidents_all_clean['Year'], accidents_all_clean['Miles Flown']/1000000000, align='center', alpha=0.5, color='#28d48c')
# Add legends, titles, axes labels
plt.xticks(accidents_all_clean['Year'])
plt.ylabel('Miles Flown (Billions)')
plt.title('Increase in Miles Flown (in billions) in the United States over Time')
ax.set_xlabel('Year 1983 - 2014')
fig.savefig('results/miles.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot accident data as a line plot
plt.plot(1000000 * accidents_all_clean['Accidents, All']/accidents_all_clean['Miles Flown'], "-", color = 'black', label = 'Accidents Per Millions Miles, All')
plt.plot(1000000 * accidents_all_clean['Accidents, Fatal']/accidents_all_clean['Miles Flown'], "--", color = 'red', label = 'Accidents Per Millions Miles, Fatal')
plt.plot(1000000 * accidents_all_clean['Accidents, Major']/accidents_all_clean['Miles Flown'], "--", color = 'orange', label = 'Accidents Per Millions Miles, Major')
plt.plot(1000000 * accidents_all_clean['Accidents, Serious']/accidents_all_clean['Miles Flown'], "--", color = 'green', label = 'Accidents Per Millions Miles, Serious')
plt.plot(1000000 * accidents_all_clean['Accidents, Injury']/accidents_all_clean['Miles Flown'], "--", color = 'blue', label = 'Accidents Per Millions Miles, Injury')
plt.plot(1000000 * accidents_all_clean['Accidents, Damage']/accidents_all_clean['Miles Flown'], "--", color = 'purple', label = 'Accidents Per Millions Miles, Damage')
# Add titles, legends, axes labels
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents per Million Miles Flown')
plt.legend(loc='upper right')
fig.savefig('results/accidents_miles_lines.png')
plt.show()
fatal_rate_df = 1000000 * accidents_all_clean['Accidents, Fatal']/accidents_all_clean['Miles Flown']
fatal_rate_df
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot the number of fatalities per million miles
plt.plot(1000000 * accidents_all_clean['Fatalities, Total']/accidents_all_clean['Miles Flown'], "-", color = 'black', label = 'Fatalities Per Million Miles, Total')
plt.plot(1000000 * accidents_all_clean['Fatalities, Aboard']/accidents_all_clean['Miles Flown'], "--", color = 'green', label = 'Fatalities Per Million Miles, Aboard')
plt.xticks(accidents_all_clean['Year'])
# Add titles, axes, and legend
plt.title('Fatalities Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Fatalities Per Million Miles Flown')
plt.legend(loc='upper left')
fig.savefig('results/fatalities_miles_lines.png')
plt.show()
For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross annual flight hours. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of annual flight hours. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.
# Set up the plot
fig = plt.figure(1, figsize=(18, 6))
ax = fig.add_subplot(1, 1, 1)
# Plot the number of flight hours
plt.bar(accidents_all_clean['Year'], accidents_all_clean['Flight Hours']/1000000, align='center', alpha=0.5)
# Add titles, axes, and legend
plt.xticks(accidents_all_clean['Year'])
plt.ylabel('Hours in Millions')
plt.title('Increase in Flight Hours in the United States over Time')
ax.set_xlabel('Year 1983 - 2014')
fig.savefig('results/hours_bars.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot accident data per flight hours as a line plot
plt.plot(1000000*accidents_all_clean['Accidents, Major']/accidents_all_clean['Flight Hours'], "--", color = 'orange', label = 'Accidents Per Million Flight Hours, Major')
plt.plot(1000000*accidents_all_clean['Accidents, Fatal']/accidents_all_clean['Flight Hours'], "--", color = 'red', label = 'Accidents Per Million Flight Hours, Fatal')
plt.plot(1000000*accidents_all_clean['Accidents, All']/accidents_all_clean['Flight Hours'], "-", color = 'black', label = 'Accidents Per Million Flight Hours, All')
# Add titles, axes, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Accidents per Million Flight Hours')
plt.legend(loc='upper left')
fig.savefig('results/accidents_major_hours_lines.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot accident data per flight hours as a line plot
plt.plot(1000000*accidents_all_clean['Accidents, Serious']/accidents_all_clean['Flight Hours'], "--", color = 'green', label = 'Accident Rate, Serious')
plt.plot(1000000*accidents_all_clean['Accidents, Injury']/accidents_all_clean['Flight Hours'], "--", color = 'blue', label = 'Accident Rate, Injury')
plt.plot(1000000*accidents_all_clean['Accidents, Damage']/accidents_all_clean['Flight Hours'], "--", color = 'purple', label = 'Accident Rate, Damage')
# Add titles, axes, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Accidents per Million Flight Hours')
plt.legend(loc='upper left')
fig.savefig('results/accidents_minor_hours_lines.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot fatalities as a line plot
plt.plot(1000000*accidents_all_clean['Fatalities, Total']/accidents_all_clean['Flight Hours'], "-", color = 'black', label = 'Fatality Rate, Total')
plt.plot(1000000*accidents_all_clean['Fatalities, Aboard']/accidents_all_clean['Flight Hours'], "--", color = 'green', label = 'Fatality Rate, Aboard')
plt.xticks(accidents_all_clean['Year'])
# Add titles, axes, and legend
plt.title('Fatalities Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Fatalities Per Million Flight Hours')
plt.legend(loc='upper left')
fig.savefig('results/fatalities_hours_lines.png')
plt.show()
fatality_rate_df = 1000000*accidents_all_clean['Fatalities, Aboard']/accidents_all_clean['Flight Hours']
fatality_rate_df
After addressing parts 1a-1c for just commercial* flights, I have data in table 4 contains the same accident, fatality, and rate data but instead for general aviation in the United States. I will use this to compare the relative safety of commercial flight vs general flight (include private and personal flights). I will choose just one metric - accidents per flight hour, and plot both the commercial and general aviation lines on the same plot in different colors.
genav_clean.head(5)
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot accident data as a line plot
plt.plot(accidents_all_clean['Accidents, All'], "-", color = 'blue', label = 'Commercial Accidents, All')
plt.plot(accidents_all_clean['Accidents, Fatal'], "--", color = 'blue', label = 'Commercial Accidents, Fatal')
plt.plot(genav_clean['Accidents, All'], "-", color = 'orange', label = 'General Aviation Accidents, All')
plt.plot(genav_clean['Accidents, Fatal'], "--", color = 'orange', label = 'General Aviation Accidents, Fatal')
# Add titles, axes, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents')
plt.legend(loc='upper right')
fig.savefig('results/public_private_accidents.png')
plt.show()
# Set up plot
fig = plt.figure(1, figsize=(14, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot number of accidents for both commercial and general aviation
plt.plot(accidents_all_clean['Fatalities, Total'], "-", color = 'blue', label = 'Commercial Fatalities, Total')
plt.plot(accidents_all_clean['Fatalities, Aboard'], "--", color = 'blue', label = 'Commercial Fatalities, Aboard')
plt.plot(genav_clean['Fatalities, Total'], "-", color = 'orange', label = 'General Aviation Fatalities, Total')
plt.plot(genav_clean['Fatalities, Aboard'], "--", color = 'orange', label = 'General Aviation Fatalities, Aboard')
# Add titles, axes, and legend
plt.title('Aircraft Fatalities Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Fatalities')
plt.legend(loc='upper right')
fig.savefig('results/public_private_fatalities.png')
plt.show()
# Set years to be labels
labels = accidents_all_clean['Year'].values
# Make years to be the x axis
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
# Create bar plot
fig, ax = plt.subplots(figsize=(16, 8))
rects1 = ax.bar(x - width/2, accidents_all_clean['Flight Hours']/1000000, width, label='Commercial Flight Hours')
rects2 = ax.bar(x + width/2, genav_clean['Flight Hours'].loc[1983:2015]/1000000, width, label='General Aviation Flight Hours')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Flights Hours in Millions')
ax.set_xlabel('Years 1983 - 2014')
ax.set_title('Change in Annual Flight Hours for Commercial and General Flight')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
fig.savefig('results/public_private_hours.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot accident data as a line plot respective to total number of flight hours
plt.plot(accidents_all_clean['Accidents, All']/(accidents_all_clean['Flight Hours']/1000000),
"-", color = 'blue', label = 'Commercial Accidents per Millions Hours, All')
plt.plot(accidents_all_clean['Accidents, Fatal']/(accidents_all_clean['Flight Hours']/1000000),
"--", color = 'blue', label = 'Commercial Accidents per Millions Hours, Fatal')
plt.plot(genav_clean['Accidents, All']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000),
"-", color = 'orange', label = 'General Aviation Accidents per Million Hours, All')
plt.plot(genav_clean['Accidents, Fatal']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000),
"--", color = 'orange', label = 'General Aviation Accidents per Million Hours, Fatal')
# Add some text for labels, title and custom x-axis tick labels, etc.
plt.title('Aircraft Accidents Per Million Flight Hours in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Accidents Per Million Flight Hours')
plt.legend(loc='upper right')
fig.savefig('results/public_private_accidents_hours.png')
plt.show()
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot fatality rates data as a line plot
plt.plot(accidents_all_clean['Fatalities, Total']/(accidents_all_clean['Flight Hours']/1000000),
"-", color = 'blue', label = 'Commercial Fatalities, Total')
plt.plot(accidents_all_clean['Fatalities, Aboard']/(accidents_all_clean['Flight Hours']/1000000),
"--", color = 'blue', label = 'Commercial Fatalities, Aboard')
plt.plot(genav_clean['Fatalities, Total']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000),
"-", color = 'orange', label = 'General Aviation Fatalities, Total')
plt.plot(genav_clean['Fatalities, Aboard']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000),
"--", color = 'orange', label = 'General Aviation Fatalities, Aboard')
# Add some text for labels, title and custom x-axis tick labels, etc.
plt.title('Aircraft Fatalities Per Million Hours Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Fatalities')
plt.legend(loc='upper right')
fig.savefig('results/public_private_fatalities_hours.png')
plt.show()
Finally, in looking at source 1, my most detailed data source, I can see more granular information for each accident. This contains the type of airplane for every data entry. To present to the reader what planes are responsible for the most crashes, I will create a pivot table that sums the number of accidents for each plane. Then, I will create a table for the reader that displays just the top 5 planes. Since I do not expect the reader to have a knowledge of all types of planes, I will either include an image or brief description of the plane.
This methodology was further refined in to two sub questions. One looked at the most common plane amongst all incidents, the second method looked at the most common plane in fatal accidents.
faa_aids.columns
# Next, we essentially create a pivot table of the make/model of the airplanes
plane_type_subset = faa_aids[['AIDS Report Number', 'Aircraft Make','Aircraft Model']].copy()
plane_type_subset['Aircraft Make Model'] = plane_type_subset['Aircraft Make'] + ' ' + plane_type_subset['Aircraft Model']
# Drop rows with no make/model data
plane_type_subset = plane_type_subset.replace('nan', np.nan)
plane_type_subset_clean = plane_type_subset.dropna()
# Create a count table
type_count = plane_type_subset_clean.groupby(['Aircraft Make Model']).count()
# Drop unused columns
type_count = type_count.drop(columns=["Aircraft Make","Aircraft Model"])
# Rename one column
type_count = type_count.rename(columns={"AIDS Report Number": "Number of Incident Reports"})
# Sort and display by top count
df = type_count.sort_values(by =['Number of Incident Reports'], ascending = False).head(5)
df
df.to_csv('results/incidents_by_make_model.csv')
## Cessna 172
Image(filename='results/Cessna172S.jpg')
## Piper PA-28
Image(filename='results/PA-28.jpg')
## Boeing 727
Image(filename='results/B-727.jpg')
## Cessna 210
Image(filename='results/Cessna210.jpg')
## Mooney M-20
Image(filename='results/M-20.jpg')
For the next analysis we will look at the aircraft with the most fatal accidents.
# Next, we essentially create a pivot table of the make/model of the airplanes
fatal_plane_type_subset = faa_aids[['AIDS Report Number', 'Aircraft Make','Aircraft Model',
'Total Fatalities']].copy()
fatal_plane_type_subset['Aircraft Make Model'] = fatal_plane_type_subset['Aircraft Make'] + ' ' + fatal_plane_type_subset['Aircraft Model']
# Drop rows with no make/model data
fatal_plane_type_subset = fatal_plane_type_subset.replace('nan', np.nan)
fatal_plane_type_subset_clean = fatal_plane_type_subset.dropna()
# Drop items with zero fatalities
fatal_plane_type_subset_clean = fatal_plane_type_subset_clean[fatal_plane_type_subset_clean['Total Fatalities'] > 0]
# Group by make/model, Sort by largest count
fatal_type_count = fatal_plane_type_subset_clean.groupby(['Aircraft Make Model']).count()
# Rename columns
fatal_type_count = fatal_type_count.rename(columns={"AIDS Report Number": "Number of Incident Reports"})
# Drop unused columns
fatal_type_count = fatal_type_count.drop(columns=["Aircraft Make","Aircraft Model","Total Fatalities"])
# Sort and display top 5 by largest count
df = fatal_type_count.sort_values(by =['Number of Incident Reports'], ascending = False).head(5)
df
df.to_csv('results/fatalities_by_make_model.csv')
## Cessna 182
Image(filename='results/Cessna182.jpg')
## DHC
Image(filename='results/DHC-6.jpg')
## Cessna 180
Image(filename='results/Cessna180.jpg')
## Cessna 206
Image(filename='results/Cessna206.jpg')
## DC-3
Image(filename='results/DC-3.jpg')
A question a reader may ask if they are boarding a plane is - how experienced is my pilot? Data source 1, the detailed table, also contains a column with number of hours of experience of the Pilot In Control for most accidents. I will extract this column and create a histogram so that the reader can see the distribution of number of experience hours of pilots who have been in crashes, and observe for themselves if they think that number of hours could be a factor in safety.
# Next, we subset the data to just the columns of interest
PIC_subset = faa_aids[['AIDS Report Number', 'Aircraft Damage', 'Total Fatalities', 'PIC Certificate Type',
'PIC Flight Time Total Hrs', 'PIC Flight Time Total Make-Model',
'Aircraft Make', 'Aircraft Model']].copy()
PIC_subset['Aircraft Make Model'] = PIC_subset['Aircraft Make'] + ' ' + fatal_plane_type_subset['Aircraft Model']
# Drop rows which have NA for any of the 3 PIC columns of interest
PIC_subset = PIC_subset.dropna(subset=['PIC Certificate Type', 'PIC Flight Time Total Hrs', 'PIC Flight Time Total Make-Model'])
# Group by PIC certificate type, Sort by largest count
PIC_cert_count = PIC_subset.groupby(['PIC Certificate Type']).count()
# Drop unused columns
PIC_cert_count = PIC_cert_count.drop(columns=["Aircraft Make","Aircraft Model","Total Fatalities","Aircraft Damage","Total Fatalities","PIC Flight Time Total Hrs", "PIC Flight Time Total Make-Model", "Aircraft Make Model"])
# Rename column
PIC_cert_count = PIC_cert_count.rename(columns={"AIDS Report Number": "Number of Incident Reports"})
# Sort and display by largest count
df = PIC_cert_count.sort_values(by =['Number of Incident Reports'], ascending = False).head(10)
df
df.to_csv('results/incidents_by_PIC_type.csv')
# Look at a histogram of the distributions of pilots experience
# Experience here is measured by total number of flight hours
# Set up the plot
fig = plt.figure(1, figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)
# Create histogram with data
plt.hist(PIC_subset['PIC Flight Time Total Hrs'], 100)
# Legends, titles, axes labels
plt.title('Flight Time Experience of Pilots in Accidents')
ax.set_ylabel('Number of Incidents')
ax.set_xlabel('Hours of Experience')
plt.show()
# Look at a histogram of the distributions of pilots experience
# Experience here is measured by total number of flight hours in make/model
# Set up the plot
fig = plt.figure(1, figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)
# Legends, titles, axes labels
plt.title('Flight Time Experience on Make/Mode of Pilots in Accidents')
ax.set_ylabel('Number of Incidents')
ax.set_xlabel('Hours of Experience on Model')
# Create histogram with data
plt.hist(PIC_subset['PIC Flight Time Total Make-Model'], 100)
plt.show()
# Look at a histogram of the distributions of pilots experience
# Experience here is measured by total number of flight hours
# Set up the plot
fig, (ax1, ax2) = plt.subplots(2, figsize = (10,8))
fig.suptitle('Experience Level of Pilots in Accidents')
ax1.set_xlabel('Hours of Experience')
ax1.set_ylabel('Number of Accidents Reported')
ax2.set_xlabel('Hours of Experience on Make/Model')
ax2.set_ylabel('Number of Accidents Reported')
# Create histogram with data
ax1.hist(PIC_subset['PIC Flight Time Total Hrs'], 100)
ax2.hist(PIC_subset['PIC Flight Time Total Make-Model'], 100)
fig.savefig('results/PIC_experience_hours.png')
plt.show()
To summarize some of the main findings of this analysis, the total number of accidents and fatalities has not changed too much over time. However, the quantity of air travel has changed a lot over time. So, respective to the increase in popularity of air travel, flight has become safer.
When comparing public flight and private flight, private flight clearly appears to be more dangerous. Not just the grand totals, but respective to the amount of flight hours. The next natural question is - why would this be? I have a couple of hypotheses. Some of the following research questions help to answer this.
We saw that most of the fatal crashes were from quite small planes (Cessnas) with a few exceptions of some more complicated twin engine airplanes. We also saw the importance of pilot's experience as an attributing factor to safety. The pilot's experience on that specific make/model of the airplane turned out to be even more important than the pilots experience in general.
One conclusion I will draw is that larger planes (which are used for commercial flights) end up being safer. This is because the larger the plane is, the more hours that are required by the FAA to be certified. The more flight hours they have, the more experienced the pilot it. As we saw, the more experienced the pilot, the less likely you are to be in a fatal crash, or any crash at all. This is where regulation comes in to play - the FAA has different requirements to be licensed to fly each different type of aircraft. A pilots license on one aircraft is not automatically transferrable to another. We can see this is for a good reason.
Another conclusion I want to explore is why sheer numbers make it unprobable that a commercial airliner would get in to an accident. The larger a plane is, the more expensive it is. An expensive plane is unlikely to be privately owned, and fewer of them probably exist in general because few companies can afford them. Since there are fewer large planes in terms of numbers, there are less chances for them to crash. This is why the visuals all brought attention to the fact that private flight appeared a lot more dangerous than commercial flight.
However, another point to note is that on average, commercial flight was safer (using the metric of safety as fatalities per million miles), but for years where there was a commercial accident, the death tolls were around the same. This is because there could be 30 private flight accidents but they likely won't have more than 4 people on them. Just one commercial airline accident could significantly change the outlook of these numbers. This is why the safety of commercial flight is so much moreconsequential and so much more widely publicized than the accidents in the private sector. There are simply more lives at risk in one large passenger plane.
All in all, there is one final connection I want to make. Some of the reasons I attributed for the relative safety of commercial flight over private flight were:
I'd like to draw a parallel to another mode of transportation: cars. It's not a secret that driving is a lot more dangerous than flying. The same number of people die every month in the United States from automobile accidents as were killed in the 9/11 terrorism attack. I think what we learned in this analysis is applicable as well to car safety. I would suspect some of the key reasons driving is less safe is because cars are very cheap in comparison. Everyone has them. And there are few requirements to be certified and given a drivers license. I would be interested to see the same plot with drivers hours of experience behind the wheel. What has helped aviation in this quest for safety has been tighter regulation around requirements for certification. Perhaps something to consider for future discussion is the requirements for getting a drivers license.
Some limitations of this data set is that it only went from 1983-2014. 2014 is not very recently, but unfortunately the FAA releases their cleaned data in 5 year increments, from what I read. Nonetheless, this data would be useful to ensure that the trends still hold for the past 5 years. Obviously it would have been more interesting and useful to readers if this analysis included this past years data. Another limitation of this data is that a lot of data sources are in a roll up agreggate format. This is convenient but can sometimes lead to some ambigutity as to what all the different categories are. This can require some digging in to a lot of aviation jargon to sort out the distinctions and ensure there is no double counting.
Throughout this process - from choosing a topic, to choosing data, to creating a code repo, I utilized many of the principles of human centered data science I learned in this class. The topic naturally came to me, since I work at boeing. I get asked probably once a week "what's it like working at boeing right now?". The truth is - our day to day work as engineers not on the 737Max program is the same. This is a sign of good upper management if you ask me - that's not my job to deal with! But the reality is, the 737Max issue has gotten political. As soon as something gets political, the journalism quality inherently goes down and fact finding becomes difficult. I have seen this happen with the Max and it is frustrating to watch. I think many other share the same frustration as me with the media taking hold of any hot topic it can and beating it to death. I truly think an objective, data driven analysis on the facts of airplane safety would be useful for the general public right now. I know this is something I would like to see as an outsider. Ultimately, you want to arm the general public with honest, truthful information so we can start to overcome the wave of #fakenews from 2016-2018.
I thought carefully about the ethics of this before I posed my questions. Is there a chance this could slander or degrade Boeing? Could it slander another company? If it is just the numbers, is it really even slander? I ultimately determined that as long as I just focused on the United States (the only demographic for which I had good data), I could maintain the integrity of the analysis. Since the FAA releases all of this data, there is nothing that I could find that the FAA probably hasn't unearthed. Not to mention, anyone else could (and might have already) found this data and posted their findings all over huffington post for the world to see. I had some confidence that the results I found would not be shocking findings for the aerospace industry, but I tried not to let this influence my analysis.
That was of course another factor - did I think that I could be unbiased, as an employee of the Boeing company? To eliminate biases, I tried to be thoughtful about my own research questions. I did not look at differences between Boeing and Airbus, and I did not focus on any specific plane. Instead, I tried to keep my investigation to comparing the differences between private and commercial flight at large.
Because of the nature of this data, and the danger of "predictions" when it comes to people's safety, I decided not to pursue any sort of predictive models for my analysis or research questions. Unless you are trying to predict a crash, there really isn't too much you can classify and predict in this field. Even in the case of predicting an accident, any model created would have a very high accuracy only because accidents are so infrequent. When an accident actually did occur, it would be unlikely the model would have performed correctly so false negatives would be high. For this reason, I just took a descriptive data visualization approach. I'd like for readers to be able to make connections for themselves, with the proper visualizations, but don't want to be prescriptive about the certainty of any individual factor as there can be so many confounding factors such as age of the plane, maintenance, and crew which cannot be accounted for in this data.
For the creation of my code repository, I followed the best practices I learned from this class on data science reproducibility. I have included all my code, my source data, my cleaned data, cited my sources, provided licenses where appropriate, and noted limitations and disclaimers where appropriate. My final repo is well organized with a couple folders, and a README explaining each folder and file.
Interpretibility is another key concept covered in human-centered data science. In this course, the focus was more on algorithmic interpretibility. Since that was not necessary relevant for my analysis, I instead focused on visualization interpretibility. Thankfully, last quarter I took the data visualization course where we often focused on the integrity of visualizations. Does the visualization accurately communicate it to the user, or does it distort the data to achieve a certain end? Certainly, ethics is a question to ask when creating visualizations as well. What you choose to display, what dimensions you use, and what data you use can all distort the results you are presenting to the reader. With the intention of human centered integrity, I tried to remember the tenants of this course as well and choose appropriate and diverse visualizations to present to the readers, accompanied with an analysis. All of the final visualizations are available in a folder in the repo called "results". It contains all the plots saved as pngs.
Overall, I really enjoyed being able to choose my own topic for this analysis. I could really pair it with my passions. Ultimately, I am proud of what I put together - I believe it is thorough, detailed, reproducible, well documented, and truthful. I am excited to show this repo to folks that I work with!
For these data sets, accidents which were caused by an illegal act such as suicide, sabotage, or terrorism were NOT included in rate calculations. However, it is factored in to the overall accident and fatality numbers. For example, 9/11 2001 terrorism fatalies & accidents are included but do not affect the rate calculations for accidents per million miles flown or fatalies per million miles flown. Additionally, only those killed on board the planes in the 9/11 terrorist attack are included in the fataly numbers. Unless otherwise stated all fatalies are on board fatalies. Information on acts of Suicide, Sabotage, or Terrorism on 14 CFR 121 Flights are available elsewhere.
There are some inconsistencies in the data as to which year is the last complete year of data. In general, all data stops at 2015 because the FAA releases a complete, verified set of data every 5 years (the next big release is 2020). Some data is incomplete for 2015 and some data is incomplete for 2014.
[1] https://www.govinfo.gov/app/details/CFR-2011-title14-vol3/CFR-2011-title14-vol3-part121/context
[2] https://catalog.data.gov/dataset/air-carrier-occurrences-involving-illegal-acts-sabotage-suicide-or-terrorism-1995-through-
[3] https://www.sbs.com.au/news/how-safe-is-flying-here-s-what-the-statistics-say
[4] https://www.anxieties.com/flying-howsafe.php
[5] https://www.bloomberg.com/news/articles/2019-05-30/flying-has-become-more-dangerous-don-t-just-blame-boeing
[6] https://en.wikipedia.org/wiki/De_Havilland_Canada_DHC-6_Twin_Otter#/media/File:WinAir_De_Havilland_Canada_DHC-6-300_Twin_Otter_Breidenstein.jpg
[7] https://en.wikipedia.org/wiki/Douglas_DC-3#/media/File:Douglas_DC-3,_SE-CFP.jpg
[8] https://en.wikipedia.org/wiki/File:Cessna182t_skylane_n2231f_cotswoldairshow_2010_arp.jpg
[9] https://en.wikipedia.org/wiki/Cessna_180#/media/File:Cessna.180a.g-btsm.arp.jpg
[10] https://en.wikipedia.org/wiki/Cessna_206#/media/File:Cessna.206h.stationair2.arp.jpg
[11] https://en.wikipedia.org/wiki/Cessna_172#/media/File:Cessna_172S_Skyhawk_SP,_Private_JP6817606.jpg
[12] https://en.wikipedia.org/wiki/Boeing_727#/media/File:B-727_Iberia_(cropped).jpg
[13] https://en.wikipedia.org/wiki/Piper_PA-28_Cherokee#/media/File:PiperPA-28-236DakotaC-GGFSPhoto4.JPG
[14] https://en.wikipedia.org/wiki/Cessna_210#/media/File:Cessna.210.centurion.d-ebws.arp.jpg
[15] https://en.wikipedia.org/wiki/Mooney_M20#/media/File:Mooney.m20j.g-muni.arp.jpg